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Abstract 

This paper proposes the problem of modeling video se- 
quences of dynamic swarms (DS). We define DS as a large 
layout of stochastically repetitive spatial configurations of 
dynamic objects (swarm elements) whose motions exhibit 
local spatiotemporal interdependency and stationarity, i.e., 
the motions are similar in any small spatiotemporal neigh- 
borhood. Examples of DS abound in nature, e.g., herds of 
animals and flocks of birds. To capture the local spatiotem- 
poral properties of the DS, we present a probabilistic model 
that learns both the spatial layout of swarm elements and 
their joint dynamics that are modeled as linear transforma- 
tions. To this end, a spatiotemporal neighborhood is asso- 
ciated with each swarm element, in which local stationarity 
is enforced both spatially and temporally. We assume that 
the prior on the swarm dynamics is distributed according to 
an MRF in both space and time. Embedding this model in 
a MAP framework, we iterate between learning the spatial 
layout of the swarm and its dynamics. We learn the swarm 
transformations using ICM, which iterates between estimat- 
ing these transformations and updating their distribution in 
the spatiotemporal neighborhoods. We demonstrate the va- 
lidity of our method by conducting experiments on real and 
synthetic video sequences. Real sequences of birds, geese, 
robot swarms, and pedestrians evaluate the applicability of 
our model to real world data. 

1. Introduction 

This paper is about modeling of video sequences of 
a dense collection of moving objects which we will call 
swarms. Examples of dynamic swarms (DS) in nature 
abound: a colony of ants, a herd of animals, people in a 
crowd, a flock of birds, a school of fish, a swarm of hon- 
eybees, trees in a storm, and snowfall. In artificial settings, 
dynamic swarms are illustrated by: fireworks, a caravan of 
vehicles, sailboats on a lake, and robot swarms. A DS is 
characterized by the following properties. (1) All swarm el- 
ements belong to the same category. This means that the 



appearances (i.e. geometric and photometric properties) of 
the elements are similar although not identical. For exam- 
ple, each element may be a sample from the same under- 
lying probability density function (pdf) of appearance pa- 
rameters. (2) The swarm elements occur in a dense spatial 
configuration. Thus, their spatial placement, although not 
regular, is statistically uniform, e.g., determined by a cer- 
tain pdf (3) Element motions are statistically similar. (4) 
The motions of the swarm elements are globally indepen- 
dent. In other words, the motions of two elements that are 
sufficiently well separated are independent. However, this is 
not strictly true on a local scale because if they are located 
too close compared to the extents of their displacements, 
then their motions must be interdependent to preserve sep- 
aration. Thus, the motion parameters of each element vs. 
the other elements can be considered as being chosen from 
a mutually conditional pdf Occasional variations in these 
swarm properties are also possible, e.g. elements may be- 
long to multiple categories such as different types of vehi- 
cles in traffic. Fig. [T] shows some examples of DS. 




Figure 1 . Examples of swarms 



This definition of DS is reminiscent of dynamic textures 
(DT). Indeed, a DS is analogous to a DT of complex non- 
point objects. The introduction of complex nonpoint objects 
introduces significant complexity: (1) Extraction of non- 
point objects becomes necessary, whose added complexity 
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is evident from, e.g., the algorithm of (3). (2) Motion for 
nonpoint objects is richer than point objects, e.g., rotation 
and nonrigid transformations become feasible. Since most 
work on DTs has focused on textures formed of pixel or sub- 
pixel objects, DS is a relatively unexplored problem. Tools 
for DS analysis should be useful for general problems such 
as dynamic scene recognition, dynamic scene synthesis, and 
anomaly detection, as well as, specific problems such as the 
motion analysis of animal herds or flocks of birds. In this 
paper, we present an approach to derive the model of a DS 
from its video, and demonstrate its efficacy through exam- 
ple applications. Before we do this, we first review the work 
most related to DS, namely, that on DT. 

Related Work 

A DT sequence captures a random spatiotemporal phe- 
nomenon which may be the result of a variety of physi- 
cal processes, e.g., involving objects that are small (smoke 
particles) or large (snowflakes), or rigid (flag) or nonrigid 
(cloud, fire), moving in 2D or 3D, etc. Even though the 
overall global motion of a DT may be perceived by humans 
as being simple and coherent, the underlying local motion 
is governed by a complex stochastic model. Irrespective of 
the nature of the physical phenomena, the objective of DT 
modeling in computer vision and graphics is to capture the 
nondeterministic, spatial and temporal variation in images. 

As discussed earlier, although the basic notion of DTs al- 
lows that both spatial and temporal variations be complex, 
the limited work done on DT's has focused on moving ob- 
jects (texels) that have little spatial complexity, even as they 
exhibit complex motion. The texels are of negligible size 
(e.g., smoke particles), whose movement appears as a con- 
tinuous photometric variation in the image, rather than as a 
sparser arrangement of finite (nonzero) size texels. Conse- 
quently, the DT model must mainly capture the motion and 
less is needed to represent the spatial structure. 

Statistical modeling of spatiotemporal interdependence 
among DT images serves as being closest to the work we 
present here. This work includes the spatiotemporal auto- 
regressive (STAR) model by Szummer et al. 1 13 1 and multi- 
resolution analysis (MRA) trees by Bar- Joseph et al. |5|. 
The DT model of Doretto et al. fVT\ uses a stable linear dy- 
namical system (LDS). LDS mixture models have been de- 
veloped in O and implemented on DT clustering and seg- 
mentation. In |T^, a mixture of globally coordinated PPCA 
models was employed to model a DT. 

Along with their merits, the previously proposed mod- 
els also suffer from certain shortcomings, (i) These mod- 
els make restrictive assumptions about the DT sequences. 
Most of them assume that there is either a single DT cov- 
ering each frame in the sequence. The others that consider 
multiple DT's are usually limited to particle textures (e.g. 
water and smoke). Consequently, these models cannot be 



easily extended to dynamic swarms. Even if the texels were 
known beforehand, learning a separate model for each texel 
does not guarantee the underlying spatiotemporal stationar- 
ity of DS. (ii) They do not make a clear separation between 
the appearance and dynamical models of the DT. The ap- 
proach proposed in | 9 1 explicitly aims at this separation, but 
it is limited to fluid DT's only. 

Another body of work that is related to our swarm mo- 
tion models a DT as a set of dynamic textons (or mo- 
tons) whose motion is governed by a Markov chain model 
||T4l|T6l. This generative model is limited to sequences of 
particle objects (e.g. snowflakes) or objects imaged at large 
distances. The texton dynamics are constrained by the un- 
derlying assumptions of the model, which state that all tex- 
tons have the same frame-to-frame transformation, that this 
transformation is constant over time, and that the dynamics 
of spatially neighboring textons are independent. While this 
work does involve moving objects containing more than one 
pixel per object as well as some interpixel spacing, its mod- 
eling power still does not match the needs of the properties 
(1-4) of a DS given above. 

In the rest of this paper, we refer to the objects form- 
ing a swarm as swarm elements. We propose a probabilis- 
tic model that learns both the spatial layout of the swarm 
elements and their joint dynamics, modeled as linear trans- 
formations, which allow for a clear separation between the 
appearance and dynamics of these elements. This joint 
representation takes into account the interdependence in 
the properties of elements that are neighbors in space and 
time. This is done by enforcing stationarity only within 
spatiotemporal neighborhoods. This local stationarity con- 
straint allows us to model DS sequences that not only ex- 
hibit globally uniform dynamics (to which previous meth- 
ods are limited), but also sequences whose element proper- 
ties and dynamics gradually change, in space and time. 

Overview of Proposed Model 

Given a DS sequence in which swarm elements undergo 
locally stationary transformations, we iterate between learn- 
ing the spatial layout of these elements (i.e. their binary al- 
pha mattes and their frame-to-frame correspondences) and 
their dynamics. We estimate swarm dynamics such that they 
follow a probabilistic model that enforces local stationarity 
within a spatiotemporal neighborhood of each element. In 
regards to spatial layout, we assume that each swarm ele- 
ment consists of one or more homogenous segments that 
also possess these spatiotemporal stationarity properties. 

We model the frame-to-frame motion of each individual 
element as a linear transformation, which reconstructs the 
element's features in a given frame from its features in the 
previous one. These features can describe local or global 
properties. In our framework, we do not restrict the choice 
of these features, since they can be application dependent. 



These linear transformations are chosen to capture a wide 
variety of possible changes especially rotation, scaling, and 
shear. Moreover, a spatiotemporal neighborhood is associ- 
ated with each element, in which local stationarity is en- 
forced. Spatially, this is done by assuming that the dynam- 
ics of elements in a given neighborhood are samples from 
the same distribution corrupted by i.i.d. Gaussian noise. 
Temporally, these dynamics are governed by an autoregres- 
sive (AR) model. We learn swarm dynamics by estimating 
the transformations that maximize the a posteriori proba- 
bility or equivalently that (i) minimize the reconstruction 
error and (ii) enforce stationarity in each element's neigh- 
borhood. 

Contributions: (1) We present an approach that learns the 
dynamics of swarm elements jointly. This is done by mod- 
eling their frame-to-frame linear transformations instead of 
directly modeling their features. Using these transforma- 
tions, our model is able to handle more complex swarm mo- 
tions and allows for a clear separation between the appear- 
ance and dynamics of a swarm. (2) Based on our assump- 
tion of local spatiotemporal stationarity, the proposed prob- 
abilistic model allows for interdependence between swarm 
elements both in time and space. This is done locally, so as 
not to limit the types of DS sequences that can be modeled, 
which is a shortcoming of most other methods. (3) The pro- 
posed model and learning algorithm estimate the spatial lay- 
out of swarm elements by enforcing temporal coherence in 
determining their frame-to-frame correspondences and the 
spatial stationarity of their dynamics 

2. Proposed Spatiotemporal Model 

In this section, we give a detailed description of our spa- 
tiotemporal model for the spatial layout and dynamics of a 
DS. We consider sequences whose fundamental spatial ele- 
ments are opaque objects. The changes these elements un- 
dergo are stationary, both spatially and temporally. We also 
assume that each swarm element consists of one or more 
homogenous segments that also possess these spatiotempo- 
ral stationarity properties. To learn the spatial layout of a 
swarm, we refrain from using texel extraction algorithms 
(e.g I3l) or multiple object trackers from the literature (e.g. 
ifTSll ). This is because they do not make use of the spa- 
tiotemporal relationship inherent to swarm elements. In- 
stead, we revisit the video segmentation algorithm of |71, 
which has some interesting properties that we exploit to 
learn spatial layout. Since no explicit tracking is performed 
on the swarm elements, occlusion handling remains a prob- 
lem and is left for future work. To enforce stationarity, we 
assume that the dynamics of the swarm elements are dis- 
tributed according to an MRF in both space and time. In our 
model, the dynamics of each swarm element is influenced 
by its spatial and temporal neighbors, within its spatiotem- 



poral neighborhood. Unlike other dynamical models (e.g. 
IT2IIT6I) that assume spatial independence between texture 
elements, we maintain spatiotemporal dependence among 
swarm elements to render a more constrained model. In 
what follows, we give a clear mathematical formulation of 
our problem. 

We are given F frames of size M x N constituting a 
swarm sequence. Frame t in this sequence contains Kt 
swarm elements. This permits that elements can disappear 
and be formed at different time instances. A swarm element 
consists of one or more adjacent low-level image segments 
that have similar dynamics. Note that any low-level seg- 
mentation algorithm can be used here. In the following sec- 
tions, we show how we iterate between learning the spatial 
layout of the elements and their dynamics. At a given iter- 
ation, we fix element dynamics and update the swarm ele- 
ments by clustering segments to enforce spatiotemporal sta- 
tionarity. Then, we update the dynamics of the new swarm 
elements. 

Let us denote the swarm elements by their spatial lay- 
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outs (i.e. binary alpha mattes) } , where % 

I J t=i,i=i 

is the manifestation of the element in frame t and = 
1?^*^*^ I is the set of swarm elements in frame t. These 
swarm elements are represented by their d-dimensional fea- 

ture vectors < f} } , which describe their appear- 

l J t=i,i=i 

ances. To model local swarm dynamics, we define a linear 
transformation A^.^-' that transforms fl^-' into . Due to 
its general form, it can encompass commonly used transfor- 
mations (e.g. rotation and scaling) as well as more specific 
ones (e.g. any orthogonal or orthonormal transformation). 

We use = |^^*^ | to denote the set of transforma- 
tions for the K elements in frame t and ¥t = |^*^ | to 
denote the set of features. 

By using frame-to-frame transformations to characterize 
swarm dynamics instead of their corresponding features, we 
emphasize the separation between swarm appearance and 
dynamics. This is usually ignored in other models. This 
explicit separation allows distinction between and indepen- 
dent control of elements' appearance and motion. That is, 
we can pair any swarm elements with any dynamics. 

The goals of modeling these linear transformations are 
twofold. [Gl] We desire accurate frame-to-frame recon- 
struction of the feature vectors, which determines how well 
our model fits the underlying data. [G2] We need to im- 
pose spatial and temporal stationarity on the transforma- 
tions within a local spatiotemporal neighborhood. In the 
absence of [G2], our model is ill-posed and too general 
for any practical use. Consequently, [G2] ensures that our 
model conforms to the underlying process that generates the 



swarm elements' dynamics. 

Section [TT] gives a detailed description of how a swarm 
element's spatiotemporal neighborhood is formed. In Sec- 
tion |23] we learn the spatial layout and the linear transfor- 
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mations in a probabilistic MAP framework. 

2.1. Spatiotemporal Neighborhood in a DS 

Our dynamical model assumes spatial and temporal sta- 
tionarity for each swarm element within its spatiotemporal 

r ... >! F,Kt 

neighborhood. Let C = jA/"^ | be the set of all 

spatiotemporal neighborhoods in the sequence. AT/'^ is the 
set of elements included in the neis hborhood of T;^*^ We 
define F (t, i) to be the set of index pairs (li, v) that repre- 
sent Tu^"^ in A/'/*^ For simplicity, we decompose F (t, i) 
into two disjoint sets of indices, Ts{tji) and TT{tji), 

where Ts {t,i) = {(t, j) : 7^^^'^ G A"/^^} and Ft {t,i) = 
: T}'^ e M'^y Ts (t,i) defines the spatial neigh- 
bors of Tt^\ while Ft (t, i) defines its temporal neighbors. 



Spatial Neighborhood 

The elements, indexed by F^ (t, z), are determined by the 
generalized Voronoi regions corresponding to the elements 
present in the frame. We also weigh the "neighborness" 
of every pair of spatial neighbors. Wt (i^j) is the corre- 
sponding weight for {t^^\Ti-^^^. It is equal to the ratio 
of the length of the common boundary between the Voronoi 
regions of the neighboring elements, to the average distance 
of these elements to the common boundary. For elements 
that are not spatial neighbors, this weight is set to zero. Lo- 
cal spatial stationarity is enforced by assuming that transfor- 
mations of neighboring elements are drawn from the same 
distribution, corrupted by Gaussian i.i.d. noise. Therefore, 



we have: V (t,j) G Ts{t,i) : A 



(i) 



A 



N where 



Temporal Neighborhood 

The elements, indexed by Ft (t, i), are the manifestations 
of the element in a temporal window consisting of the 
Wt previous frames. The limits of this window are trun- 
cated to remain within the limits of the video sequence it- 
self. This is done to resolve exceptions occurring at the first 
Wt frames in the sequence. We enforce temporal stationar- 
ity by applying an AR model of order Wt to the sequence 
of transformations in this temporal window. In fact, the AR 
model has often been used to model features over time (e.g. 
(T41), but here, we use it to model the temporal variations 
of these features (i.e. the dynamics themselves). Therefore, 
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we have V A"/*^ 

Pt = min {Wt^ ^ ~ 1) ^nd N{u^ v) j\ uj. 
plicity, the AR coefficients (a G M^^), are assumed to be 
time invariant and the constant for all swarm elements. 

In Figure |2j we show an example of the spatiotemporal 
neighborhood of T^^'' with Wt = 2. Note that the num- 
ber of spatial neighbors and the "neighborness" weights can 
change from frame-to-frame. 

2.2. Model for Swarm Dynamics and Spatial Layout 

Here, we present the probabilistic model that governs 
the dynamics of swarm elements and their spatial layout in 
a DS. We model the joint probability of the spatial layout 
of the swarm elements, their features, and their dynamics. 
This is done by decomposing the joint into the prior over 
the transformations and the spatial layout, in addition, to 
the likelihood of the features given the swarm layout and 
dynamics as in Eq ([T]). In what follows, we model the three 
terms to ensure [Gl] and [G2]. 

P ({Aafji^ , {Fjfj-/ , {Tt}Q = CVtVa (1) 
where/: = P ({Fjf^ J {Ajfj"/ , {Tjf^,), = 

Likelihood Model (£) 

Since we assume a linear relationship between consecu- 
tive feature vectors, we can decompose the likelihood prob- 

ability as: C = Pi 0*="/ OSi P I f7\A'^\Tt), 

where (^^J /?\^f\T,) ^ A/" (^«/T\7?/.) and 

Pi = p (¥i I {A^}^"^ , {Tt}f^i^ is a constant with re- 
spect to the transformations. Consequently, we can write 
the negative log likelihood as in Eq (|2]). 
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Prior on Swarm Spatial Layout (Vj) 

As stated before, each swarm element consists of one or 
more homogenous segments that are produced by the al- 
gorithm of m. The spatial layout of these elements and 
their frame-to-frame correspondences must ensure that the 
swarm elements' features are reconstructed faithfully and 
that spatial stationarity of their dynamics is enforced. The 




frame: {t-2) frame: (t-1) frame: (t) 
Figure 2. Spatial neighbors are connected by solid black lines, while temporal neighbors are connected by dashed black lines. Here, 
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has two spatial neighbors {T}'^^ and T^'^^) and two temporal neighbors {T^l[ and 7^*12) comprising its spatiotemporal neighborhood 
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frame-to-frame correspondences of a swarm element are 
equivalent to many-to-many correspondences between seg- 
ments from the two frames. To formalize this problem, 
we denote the frame-to-frame correspondence between Ti^^ 
and Til^i as ^t), which is a node in the graph of all 
frame-to-frame correspondences in the swarm sequence. 
Two nodes n(^t,i) and n(^s,j) are considered neighbors in the 

graph, if any pair of |7^*^*\7^+1| and ^s^\Tsi^^ are 
spatially adjacent (i.e. share boundaries). We show an ex- 
ample in Figure [3] 




Figure 3. Two neighboring nodes of swarm elements in frames t 
and t + 1. Note that the n(^s,j) consists of two regions. 

Here, we can define a self-similarity function for each 
node, si (n(^ ^)), that quantifies the quality of frame-to- 
frame feature reconstruction. Also, we define a pairwise 
similarity function for each pair of neighboring nodes, 
^2 {^{t,i) 1 ^is,j))^ that evaluates how similar their frame-to- 
frame transformations are. This setup is similar to the one 
used in |7|. Actually, we shall see later that we use a simi- 
lar method to update the spatial layout. We use normalized 
correlation to define si{.) and S2{.), where si = 
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The prior Vj is proportional to the self and pairwise simi- 
larities of all neighboring nodes in the graph. 



Prior on Swarm Dynamics (Va) 

As C was modeled to guarantee [Gl], [G2] is accounted 
for by modeling Va as a product of potential functions de- 
fined on the set of all spatiotemporal neighborhoods. This 
decomposition is widely used to model priors on max- 
imum cliques defined on an undirected graph. We de- 
fine the potential function for each clique as the prod- 
uct of a spatial potential and a temporal potential 
^t(-)' which guarantee spatial and temporal stationarity 
in swarm dynamics, respectively. So, we have Va = 



where 
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fs and fx are potentials that evaluate how spatially and 
temporally stationary the swarm transformations are. For 

simplicity, we set fs (^Ai^\A[^'^^ = p (^A'-J^ \ A^^'^ 

and *T [K^'^) = P (^Ai''> I {AfljY'X We can ex- 
press the negative log prior as in Eq (|3]i. Note that 
P2 is a constant that depends on the "neighbomess" 
weights, Cs = EAA,(')^r f |rs(i,i)|' , and Ct = 

E 



^ Also, we assume that the normaliz- 



ing factor Z is constant with respect to the swarm dynamics, 
the noise variances, and the AR coefficients. 



In {Va) = In {Z) + In (^2) + Cs [in + Ct [in (4)] 
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2.3. Learning Swarm Layout and Dynamics 

After establishing our probabilistic model, we proceed to 
learning its parameters, {T^}^^, {A^}^^, the noise vari- 
ances as, (Jt, and 7 (i.e. {7^}^^), as well as the AR coef- 
ficients a (i.e. To do this, we embed our model 
into a MAP framework. We assume that the prior on the 
features and the prior on the noise variances are uniform. 
Replacing Eq p|3| ) in Eq ([T]), we formulate the MAP prob- 
lem as a nonlinear and non-convex minimization problem. 



min [(— ln£ 
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lnpA)-ln7^T] (4) 



Due to the complex form of Eq (|4]), we learn the spatial lay- 
out of the DS and its dynamics in an iterative fashion. In 
each iteration, we either fix the dynamics and update the 
spatial layout or vice versa. In what follows, we show the 
steps involved in updating the spatial layout and the dynam- 
ics at the iteration. 

Spatial Layout Update 

We employ a method similar to the one used for video ob- 
ject segmentation in |7| to update {Tt[j — l]}^!- We will 
only highlight the main aspects of this method and how it 
applies to modeling DS's. We create a graph whose nodes 
are all candidates for frame-to-frame correspondences be- 
tween — 1]}^^ and individual segments of these 
frames. In other words, a segment or swarm element in 
frame t corresponds to a segment or swarm element in the 
next frame, if the projection of the former into frame t -\- 1 
(according to its optical flow) overlaps with the latter. This 
graph allows for the clustering of similar and neighbor- 
ing nodes, thus, enabling many-to-many correspondences 
between consecutive frames. Once this graph is created, 
the attributes of each node and the edge weights between 
neighboring nodes are determined by si{.) and S2{.), as 
defined in Section |2.3[ For segments that do not belong 
to {Tt[j — we use identity for their transformation. 
Given this weighted undirected graph, we cluster its nodes 
into valid and invalid correspondences. This binary clus- 
tering is done using graph cuts, instead of relaxation label- 
ing. Then, the resulting valid correspondences are broken 



down into individual connected components, where con- 
nectedness is over time and space. This yields {T^j]}^-^. 
As pointed out in |7|, this method tends to cluster adja- 
cent/occluding swarm elements with similar dynamics. For 
initialization, we set {T^ [0]}^^ to all segments in the video 
sequence with non-zero optical flow. 

Dynamics Update 

Given {T^[j]}^-L, Eq. ^ can be solved iteratively using 
Iterated Conditional Modes (ICM) |6|, which guarantees a 
local minimum. In the /c* ICM iteration, the variances are 
updated to their ML estimates. Updating each requires 
the minimization of a convex quadratic, matrix problem, a 
is updated by solving a linear system of equations. In what 
follows, we index the model parameters with [k] to denote 
their estimates in the k^^ ICM iteration. 

First, we show the update equation for the AR coeffi- 
cients. Taking the gradient of Eq ^ with respect to a 
and setting it to zero renders the following update equation: 
Md[k] — fh. Here, M is the sum of Gramm matrices cor- 
responding to the transformations associated with the spa- 
tiotemporal neighborhoods at iteration k. m is the sum of 
the inner products between these transformations. 

Now, we turn to updating the transformations. At each 
ICM iteration, we fix all of them except for X = A[^^ [k] . 
Here, we isolate the dependence of Eq (|4]) on X and mini- 
mize the following convex-quadratic matrix problem. 
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where br and es represent the reconstruction and spatial 
stationarity residuals, respectively, ct represents the tem- 
poral stationarity residuals corresponding to the frames pre- 
ceding frame t. We express these terms as follows. 
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k=l 



a,X - Afl[k] + ^''/^\- ajA^l_.[k] 



Minimizing ^ (X) is a convex quadratic problem that ad- 
mits a global minimum X*. It can be obtained using gradi- 
ent descent where the rate of descent (r]) is determined by 
a line search. A closed form solution for rj can be derived. 
Till now, X has been an unconstrained linear transforma- 
tion; however, certain applications require that it belong 
to a feasible set Sd (e.g. rotation or symmetric matrices). 



To do this, we project the intermediate solution at each de- 
scent step onto E>d. In some cases, this projection is trivial. 
For example, if = {X G M^><^ : X = X^}, the projec- 
tion of X is . Using differential matrix identities, we 
can express the gradient of ^ (X) in a computationally ef- 
ficient form: = X (^^Id + bb^^ — D where /3, b, and 

and the current estimates of 



D are functions of 



the transformations and a. Algorithm[T]provides details for 
solving Eq ([5]). 

We can initialize X in two ways, (a) Set X(o) equal to 
the transformation obtained from the previous ICM iteration 
(i.e. X(o) = [/c — 1]). (b) If X is constrained to be in E>d, 
we can initialize X(o) by projecting the solution to the un- 
constrained version of Eq ([5]), denoted X^^^^, onto Sd- Set- 
ting = and using the matrix inversion lemma, we get 
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Id 



. In our experiments, both ini- 
tialization schemes had similar rates of convergence; how- 
ever, (b) tends to be more numerically unstable when (3 is 
small. For the first ICM iteration (k = 0), we initialize ev- 
ery ^[*^[0] = (^d- Numerically, we avoid division by zero 
by setting as[0] = ariO] = 7t[0] = 1. 



Algorithm 1: Gradient Descent (GD) 



Input : X(o) e §d, /3, 6, D, e 

1 Initialization: J ^ oo; ^ = 

2 while 5 > edo 

Vi = arg min^>o g (X(^) - v (V^) 



X, 



3 
4 
5 

6 

7 end 



(^) 



X(£+i) = P§^ 



(£) \\F 



(optional) 

' + 1 



Algorithm [2] combines all these update equations to- 
gether into the overall algorithm for solving Eq ^ to learn 
the swarm spatial layout and dynamics. The worst case 
complexity of this algorithm is 0{Fd^), since it is defined 
by the complexity of Algorithm [l] that has a linear conver- 
gence rate. 

3. Experimental Results 

To validate our model and evaluate the performance of 
our algorithm, we conducted experiments on synthetic se- 
quences (Section ( |3.1| )) and real sequences (Section ( |3.2| )). 
The synthetic sequences help provide quantitative evalua- 
tion. The experiments show that we can learn the dynam- 
ics of swarms and discriminate between different types of 
swarm motion. 



Algorithm 2: Learn Swarm Layout and Dynamics 


Input : 


{Ft, Tt[0], At[0]}f^,, Wt, €, jn^ax, ^max 


1 for j 
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// update spatial layout 
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get {Tt[j + from {T,[j], A,[j]}f^, 
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for t ^ 1 TO F;i^l TO Kt do 


5 




• 


find generalized Voronoi regions of Tf 


6 
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compute wt{t,i) 
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end 
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// update noise variances and transformations 
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Initialization: 5 ^ oo\k = {) 


10 


while {5 > e) AND {k < k^ax) do 


11 
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compute cr5[fc], a-T[fc], j[k], d[k] 


12 




for t ^ 1 TO F;i^l TO Kt do 


13 






• compute /?, b, D, X(o) 


14 






• B['\k + l]=GD(x^o),P,b,D,e) 


15 




end 


16 
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17 




• 


A['\j] = Bf\k + l]^t,i 


18 


end 




19 end 







3.1. Synthetic Sequences 

Model Learning: First, we construct a synthetic DS se- 
quence of F = 25 frames and K = S elements (4 leaves 



and 4 squares with a simple textured interior). Figure 4(a) 
shows a sample frame of this sequence, where the bound- 
aries of the generalized Voronoi regions are drawn in green. 
The motion of the swarm elements is synthesized by apply- 
ing a globally similar rotation T^^co . Specifically, for each 

element in every frame, O^^"^ is sampled from a Gaussian 
distribution A/" (6>o = ^^^r = ^) 

The features we used were based on a polar coordinate 
system centered at the centroid of each element, where each 
angular bin had a width of ^ rad. For each angular bin, 
we extracted two shape features (kurtosis and skew), the 
mean centroidal distance of the element boundary, and the 
mean intensity value. This yielded a feature vector of size 
d = 160. Setting e 10"^, ^max 50 and Wt 3, we ap- 
plied Algorithm |2] to learn the swarm dynamics. Running 
MATLAB on a 2.4GHz PC, our algorithm converged in 40 
ICM iterations (~ 30 seconds). Figure [4(b)l shows a sample 



transformation matrix after convergence. We evaluate our 
model fitting performance by using three measures: the re- 
construction residual error Cnit)^ the spatial residual error 
Cs{t), and the temporal residual error Ct(^) defined as: 



Ck(0 — K ^i=l II fX.i) 



en 



Cs{t) = ^Eti 



\rsit4)\\\A['^h 



\rTit,i)\\\A['^\\F 



es 



(-4«) 



They quantify the average error incurred in reconstruct- 
ing the data and enforcing stationarity in the spatiotemporal 
neighborhood of each swarm element. Clearly, the smaller 
these measures are, the better our model fits the data. Fig- 
ure 4(c) plots these measures for all frames in the sequence. 
All three measures show a stable variation with time, (s 
and Ct are consistently larger than (r due to the added 
noise corrupting each transformation. In fact, as a ^ 0, 
(s and (t both get closer to (r. Furthermore, (t is consis- 
tently larger than (s because temporal neighborhoods only 
extend Wt = 3 frames from each swarm element. In fact, 
as Wt {F — 1). Ct gets closer to ^5, since temporal sta- 
tionarity is enforced on a larger number of frames. Here, 
we point out that although the leaf and square elements are 
significantly different in appearance, their dynamics are the 
same. This reinforces the fact that our method successfully 
separates between swarm appearance and dynamics. 
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(c) modeling performance 
Figure 4. |4(a)| is a frame in the synthetic sequence. |4(b)| shows 
transformation ^^q^ , after convergence. All the video results are 
provided in the supplementary material. 



Motion Discrimination: Here, we demonstrate that the 
learned transformations can discriminate between different 
types of motion. Another synthetic DS sequence is con- 
structed in the same manner as before, but with the leaf 
and square elements now rotating in opposite directions. 
Leaf elements undergo 1Z^(i), while square elements un- 
dergo After learning the swarm dynamics, we com- 
pute all the distances (i.e. Frobenius norm of the differ- 
ence) between pairs of learned transformations. We show 
the resulting distance matrix in Figure |5ja). We see that 
the transformations corresponding to the leaf elements are 
close to each other and far from those corresponding to the 
square elements. For visualization purposes, we perform 
MDS on these pairwise distances to embed the transforma- 
tions in R^. In this space, the leaf and square dynamics are 
easily separable. Moreover, these transformations can be 
perfectly clustered using spectral clustering (K = 2). 

This result reinforces the fact that our method can suc- 
cessfully learn and discriminate between different motions 
occurring within a single DS sequence. This conclusion is 
valid as long as the "neighborness" weights associated with 
swarm elements undergoing similar dynamics are reason- 
ably higher than those moving differently. 




MDS of Texel Transformations 
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(a) distance matrix (b) MDS of swarm dynamics 

Figure 5.[5|a) shows the distances between the swarm transformations 
in the synthetic sequence. Note that brighter values designate larger 
distances, jsjb) projects the transformations onto using MDS. 



3.2. Real Sequences 

In this section, we present experimental results produced 
when Algorithm[2]is applied to real sequences where single 
or multiple elements are undergoing an underlying dynamic 
swarm motion. 



3.2.1 Single Swarm Element Sequences 

Here, we apply our algorithm to human action recognition, 
where we consider the human as a single texel. There is 
no need to determine the spatial neighborhoods of the tex- 
els. The action sequences were obtained from the Weiz- 
mann classification database 1 1 1, which contains 10 human 
actions. We use background subtraction to extract the tex- 




(a) NN recognition performance (b) confusion matrix 

Figure 6.[6fa) plots the recognition performance of a NN classifier vs. 
the number of training samples used per action type. |6jb) shows the 
confusion matrix. Darker squares indicate higher percentages. 



els. In addition to the features used earlier, we use the height 
and the width of the texel masks at each frame. 

After learning the texel transformations, we use a near- 
est neighbor (NN) classifier to recognize a test action se- 
quence, given a set of training sequences. We define the 
dissimilarity between two sequences (Si and ^2) as the 
DTW (dynamic time warping) cost needed to warp the 
transformations of Si into those of ^2, where the dissim- 
ilarity between transformations Xi and X2 is defined as: 

d {Xi 5X2) = 1 — iixTlM^-^^yF • ^^^^ ^^^^ efficiently com- 
puted using dynamic programming. Figure [6|a) plots the 
variation of the average recognition rate versus the number 
of sequences (per action class) used for training. For each 
training sample size, we randomly choose a set of such size 
from each action class and perform classification. We re- 
peat this multiple times and average the recognition rate to 
obtain the plotted values. Obviously, the performance im- 
proves as the number of training samples increases. More 
importantly, we note that a simple classifier using only one 
training sample achieves a 62% recognition rate, where ran- 
dom chance is 10%. Furthermore, Figure [6jb) shows the 
average confusion matrix. Note the high diagonal values. 
Here, we point out that confusion occurred between simi- 
lar actions especially for the ("jump", "skip") and ("run", 
"walk") pairs. Better performance is expected, when texels 
are extracted more reliably and features are more discrimi- 
native of human motion. 

3.2.2 Multiple Swarm Element Sequences 

We apply our algorithm to swarm video sequences com- 
piled from online sources. We perform model learning 
and motion discrimination on four sequences: "birds" |[T6l , 
"geese", "robot swarm" |2|, and "pedestrian" (H. 

Model Learning: The features we used were based on a po- 
lar coordinate system centered at the centroid of each swarm 
element, where each angular bin had a width of rad. For 



each angular bin, we extracted two shape features (kurto- 
sis and skew), the mean centroidal distance of the element 
boundary, and the mean intensity value. This yielded a fea- 
ture vector of size d = 100. Setting e = 10~^, jmax = 5, 
^max = 50 and Wt = 5, we applied Algorithm |2] to learn 
the spatial layout and dynamics of each swarm sequence. 
To evaluate the performance of our method, we conducted 
a leave-five-out experiment, where we learn the swarm dy- 
namics using all the frames except for five. The transforma- 
tions and features of the elements in these left out frames are 
reconstructed using the AR model. We repeated this exper- 
iment and reported the average normalized residual errors 
in Table [T] for the four sequences. These results show that 
our DS model represents the ground truth data well. Here, 
we note that the error was the highest for the "pedestrian" 
sequence due to the variability in the swarm dynamics and 
appearance. Also, we compared these residual errors to the 
case when identity is used instead of the learned transfor- 
mations (i.e. no dynamics update). The percentage ratio of 
these two errors are shown in parenthesis. We conclude that 
our learned dynamics substantially improve model fitting. 




Figure 7. "birds", "geese", "robot", and "pedestrian" swarms 
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"pedestrian" 
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8.2 (5.4) 


10.3 (4.9) 


3.5 (4.2) 


12.9 (9.5) 


es 


12.5 (6.8) 


6.5 (5.8) 


11.6 (5.5) 


15.8(11.6) 


er 


18.0 (4.1) 


14.1 (7.7) 


16.4 (4.4) 


23.1 (18.3) 



Table 1. Average normalized residual error (as percentage). The 
percentage values in parentheses are the average errors normalized 
by the error incurred when the swarm dynamics are not updated. 



Motion Discrimination: Here, we demonstrate that our 
method can discriminate between different motions (i.e. 
sequences of transformations) within the same video se- 
quence. After learning the swarm dynamics, we compute 
the dissimilarity in dynamics between every pair of swarm 
elements. We define the dissimilarity between two se- 
quences of swarm element transformations (7i and 72) as 
the dynamic time warping (DTW) cost needed to warp the 
transformations of Ti into those of 72 |11|. Such a warp- 
ing is crucial, since Ti and 72 might have different cardi- 
nalities (i.e. swarm elements do not have to appear in the 
same number of frames). This DTW cost is efficiently com- 
puted using dynamic programming. However, to compute 
this sequence-to- sequence DTW cost, we need to define a 
distance between individual transformations comprising the 
sequences. We define the distance between transformations 



Xi and X2 as: d{X^,X2) = 1 



These 



ll^l||F||X2||i.- 

DTW costs are employed in spectral clustering to cluster 
the swarm elements' dynamics. 

The "birds" and "pedestrian" sequences contain more 
than one distinguishable motion. Figure [8] illustrates the 
clustering results obtained for the "birds" sequence. The 
extracted swarm elements are color-coded in the frames ac- 
cording to their distinct motions. In this sequence, two 
types of motion co-exist: (i) a "bird-flapping" motion where 
wings oscillate up and down and (ii) a "bird-gliding" motion 
where the wings remain relatively still. On the right, Fig- 
ure [S] shows the DTW distances computed between all pairs 
of swarm element dynamics. We clearly see that type (i) ele- 
ments undergo quite different transformations than those of 
type (ii). Our approach was able to simultaneously learn the 
different dynamics in the sequence and discriminate them. 
This cannot be done by DT models such as |[T6ll . 




Figure 8. Shows the "birds" swarm example containing a "bird- 
flapping" and "bird-gliding" motion. The pairwise distances be- 
tween the learned transformations are shown on the right. 



We also apply our algorithm to "pedestrian" video se- 
quences, where humans or groups of humans are consid- 
ered swarm elements. These sequences were obtained from 
the UCSD pedestrian traffic database |8|. Figure |9] illus- 
trates the results obtained for a single pedestrian sequence 
that exhibits dense swarm activity. The extracted swarm 
elements are color-coded in the frames according to their 
distinctive dynamics. In this sequence, three types of mo- 
tion co-exist, (i) Elements (some of which are groups of 
pedestrians) move/walk from the top right comer to the bot- 
tom left corner, (ii) Other elements moves in the opposite 
direction, (iii) One element represents a person crossing the 
grass instead of walking along the diagonal path. On the 
right, Figure|9]shows the DTW distances computed between 
all pairs of swarm elements. We see that the elements of (i) 
undergo much more similar transformations than those of 
(ii)-(iii), which, in turn, have significantly different dynam- 
ics. Some pedestrian segments were not part of the spa- 
tial layout since they were indistinguishable from the back- 
ground. 




Figure 9. Shows a pedestrian example containing three types of 
motion. The extracted swarm elements are color-coded. The pair- 
wise distances between the learned transformations are shown on 
the right. Brighter squares indicate larger distances. Refer to the 
supplementary material for these and other video results. 

4. Conclusion 

This paper proposes a spatiotemporal model for learning 
the spatial layout and dynamics of elements in swarm se- 
quences. It represents a swarm element's motion as a se- 
quence of linear transformations that reproduce its prop- 
erties subject to local stationarity constraints. We con- 
ducted experiments on real sequences to demonstrate our 
approach's merit in representing swarm dynamics and dis- 
criminating between different dynamics. Our future goal 
is to apply this method to motion synthesis and recogni- 
tion. The support of the Office of Naval Research under 
grant N00014-09-1-0017 and the National Science Founda- 
tion under grant IIS 08-12188 is gratefully acknowledged. 
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